Code
set.seed(1245124)
library(tidyverse)
library(knitr)
library(gganimate)
library(gifski)
library(kableExtra)
library(glue)
library(broom)
library(bibtex)
library(patchwork)set.seed(1245124)
library(tidyverse)
library(knitr)
library(gganimate)
library(gifski)
library(kableExtra)
library(glue)
library(broom)
library(bibtex)
library(patchwork)income <- read_csv("data/avg_daily_income.csv")
women <- read_csv("data/mean_years_in_school_women_25_34_years.csv")
income_longer <- income |>
pivot_longer(cols = `1800`:`2100`,
names_to = "Year",
values_to = "Income")
women_longer <- women |>
pivot_longer(cols = `1970`:`2015`,
names_to = "Year",
values_to = "years_in_school")full_data <- income_longer |>
inner_join(women_longer, by = join_by(Year, country))
full_data <- full_data |>
mutate(Region = fct_collapse(factor(country),
"North America" = c("Canada", "USA", "Mexico"),
"South America" = c("Argentina", "Bolivia", "Brazil", "Chile", "Colombia", "Ecuador",
"Guyana", "Paraguay", "Peru", "Suriname", "Uruguay", "Venezuela"),
"Central America" = c("Antigua and Barbuda", "Bahamas", "Barbados", "Cuba", "Dominica",
"Dominican Republic", "Grenada", "Haiti", "Jamaica", "St. Lucia",
"St. Vincent and the Grenadines", "Trinidad and Tobago", "Belize",
"Costa Rica", "El Salvador", "Guatemala", "Honduras","Nicaragua",
"Panama"),
"Europe" = c("Andorra","Austria", "Belgium", "France", "Germany", "Luxembourg",
"Netherlands", "Switzerland", "UK", "Belarus", "Bulgaria", "Croatia",
"Czech Republic", "Estonia", "Hungary", "Latvia", "Lithuania", "Moldova",
"Poland", "Romania", "Russia", "Slovak Republic", "Ukraine", "Albania",
"Bosnia and Herzegovina", "Cyprus", "Greece", "Italy", "Malta", "Montenegro",
"Portugal", "Serbia", "Slovenia", "Spain", "North Macedonia", "Denmark",
"Finland", "Iceland", "Ireland", "Norway", "Sweden"),
"Middle East and North Africa" = c("Bahrain", "Egypt", "Iran", "Iraq", "Israel",
"Palestine", "Jordan", "Kuwait", "Lebanon", "Oman",
"Qatar", "Saudi Arabia", "UAE", "Yemen", "Turkey",
"Syria", "Armenia", "Azerbaijan", "Georgia",
"Afghanistan", "Libya", "Morocco", "Pakistan",
"Somalia", "Tunisia"),
"Asia" = c("Bangladesh", "Bhutan", "India", "Maldives", "Nepal",
"Sri Lanka", "Uzbekistan", "Turkmenistan", "Tajikistan", "Kyrgyz Republic",
"Kazakhstan", "Brunei", "Cambodia", "Indonesia", "Lao", "Malaysia",
"Myanmar", "Philippines", "Singapore", "Thailand", "Timor-Leste", "Vietnam",
"China", "Japan", "North Korea", "South Korea", "Mongolia", "Taiwan"),
"Oceania" = c("Australia", "Fiji", "Kiribati", "Marshall Islands", "New Zealand",
"Papua New Guinea", "Solomon Islands", "Tonga", "Vanuatu",
"Micronesia, Fed. Sts.", "Samoa"),
"Sub-Saharan Africa" = c("Algeria", "Benin", "Burkina Faso", "Cape Verde", "Chad",
"Comoros", "Cote d'Ivoire", "Djibouti", "Eritrea", "Gambia",
"Ghana", "Liberia", "Mali", "Mauritania", "Mauritius","Niger",
"Nigeria","Burundi", "Cameroon", "Central African Republic",
"Congo, Dem. Rep.", "Congo, Rep.", "Ethiopia", "Equatorial Guinea",
"Gabon", "Guinea", "Guinea-Bissau", "Kenya", "Rwanda",
"Seychelles","South Sudan", "Namibia", "Uganda",
"Sao Tome and Principe", "Senegal", "Sierra Leone", "Sudan",
"Togo","Angola", "Botswana", "Eswatini", "Lesotho", "Madagascar",
"Malawi", "Mozambique", "South Africa", "Tanzania", "Zambia",
"Zimbabwe")
)) In this analysis, we will use two variables from Gapminder: (Rosling 2017) average daily income, and average years in school of women ages 25-34. Our first dataset includes information on the average daily income in 2017 dollars from the year 1800 to 2100 (projected). Our second dataset includes the mean number of years enrolled in school for women between the ages 25-34 from 1970 to 2015. Each observation is uniquely identified by a different country. Our analysis will include the years shared by both datasets (1970-2015) and the shared countries (188 in common). Our datasets did not have any missing values or inconsistent data types. In addition, we decided to divide all of the countries in the dataset into 8 distinct regions in alignment with the “Official Listing of Countries by World Region: The Eight Groupings of the World by Location and Culture.” (Rosenberg August 28, 2024)
We hypothesize that nations with women between ages 25 and 34 that have spent more years in school will have a higher average daily income because women in this age range may have also completed graduate-level education.
full_data |>
group_by(country) |>
mutate(avg_income = mean(Income),
avg_school = mean(years_in_school)) |>
ggplot(aes(x = avg_school, y = avg_income, color = Region)) +
geom_point() +
labs(x = "Number of Years in School",
y = "Income",
title = "Income and Number of Years in School by Women across the World",
subtitle = "Income and School Years were Averaged across Country") +
theme_bw()In the scatterplot above, every dot on the graph represents a different country where they are colored by their corresponding region. The x-axis represents the average number of years that women are in school, where the y-axis displays the average daily income for a household. To properly display all the data from the years 1970 to 2015, the average of both variables was taken respective to the country and placed on the scatterplot above.
# animation functions found in R graph gallery page
# https://r-graph-gallery.com/package/gganimate.html
animation <- full_data |>
mutate(Year = as.numeric(Year)) |>
ggplot(aes(x = years_in_school, y = Income, color = Region)) +
geom_point() +
labs(x = "Number of Years in School",
y = "Income",
title = "Income and Number of Years in School by Women across the World 1970-2015",
subtitle = "Income and School Years across Country") +
theme_bw()
anim <- animation + transition_time(Year) +
ease_aes("linear")
anim\[Log_{10}(MeanIncome) = 0.3088 + 0.0959*MeanEducation \]
Because we took the base-10 logarithm of the mean income by country, a slope of 0.0959 indicates that income increases by estimate of 24.71% for every additional year of education for women ages 25-54. At 0 years of education, average income is approximately $2.04.
regression_model_data <-
full_data |>
group_by(country) |>
summarize(mean_inc = mean(Income),
mean_educ = mean(years_in_school))
lm_income_educ <- lm(log10(mean_inc) ~ mean_educ, data = regression_model_data)
kable(tidy(lm_income_educ))| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.3088025 | 0.0454792 | 6.789973 | 0 |
| mean_educ | 0.0959463 | 0.0058351 | 16.442946 | 0 |
The \(Log_{10}\) transformation of average daily income helped to linearize our model shown in the scatterplot below. Like the figure in 2.1, each dot on the graph represents a different country. The x-axis represents the average number of years that women are in school, where the y-axis displays the \(Log_{10}\) of the average daily income for a household. To properly display all the data from the years 1970 to 2015, the average of both variables was taken respective to the county and placed on the scatterplot above.
regression_model_data |>
ggplot(aes(x = mean_educ,
y = log10(mean_inc))) +
geom_point() +
geom_smooth(method = lm) +
labs(x = "Number of Years in School",
y = "Log10(Income)",
title = "Log Income and Number of Years in School for Women across the World",
subtitle = "Income and number of years in school were averaged across country") +
theme_bw()var_response <- var(log10(regression_model_data$mean_inc))
var_fitted <- var(lm_income_educ$fitted)
var_residuals <- var(lm_income_educ$residuals)
var_table <- data.frame(c("Variance in Response", "Variance in Fitted Values", "Variance in Residuals"),
c(var_response, var_fitted, var_residuals))
var_table |>
kable(col.names = c("Metric", "Variance")) |>
kable_styling(full_width = FALSE,
bootstrap_options = c("striped", "hover")) |>
add_header_above(c("Variability" = 2)) |>
row_spec(0, bold = TRUE, color = "black")| Metric | Variance |
|---|---|
| Variance in Response | 0.1810507 |
| Variance in Fitted Values | 0.1072610 |
| Variance in Residuals | 0.0737897 |
var_prop_explained <- round(var_fitted/var_response,4) * 100
var_prop_unexplained <- round(var_residuals/var_response,4) * 100
var_percent_explained <- glue("{var_prop_explained}%")
var_percent_unexplained <- glue("{var_prop_unexplained}%")Our ‘Variability’ table shows the variance in our response, fitted, and residual values respectively. In order to find how much of the variability in our reponses is explained by the model, we divide the variance of the fitted values by the variance of the response values.
Doing this math, we find the percent of variability explained by the model is 59.24%. This means our model is accounting for a moderate amount of variability. There is still 40.76% unexplained by the model, so other factors outside of years in school are influencing this response. Overall, for simple linear regression, our model is of moderate to high quality.
model_predict <- predict(lm_income_educ)
est_sigma <- sigma(lm_income_educ)
rand_error <- function(x, mean = 0, sd){
return(x + rnorm(length(x), mean, sd))
}
sim_response <- tibble(sim_income_pred = rand_error(x=model_predict,
sd=est_sigma))
combined_data <- regression_model_data |>
select(mean_educ, mean_inc, country) |>
bind_cols(sim_response)sim_reg_p <- combined_data |>
ggplot(aes(x = mean_educ,
y = sim_income_pred)) +
geom_point() +
geom_smooth(method=lm)+
labs(x = "Number of Years in School",
y = "Log10(Income)",
subtitle = "Income and years in school averaged \nacross country",
title = "Simulated Log Income based \non Regression Model" ) +
theme_bw()
obs_reg_p <- combined_data |>
ggplot(aes(x = mean_educ,
y = log10(mean_inc))) +
geom_point() +
geom_smooth(method = lm) +
labs(x = "Number of Years in School",
y = "Log10(Income)",
title = "Observed Log Income",
subtitle = "Income and years in school averaged \nacross country") +
theme_bw()
obs_reg_p + sim_reg_pOverall, the two plots are largely similar. Both the observed and simulated data follow positive, linear slopes with log-adjusted income values from about 0 to 2. The observed data plot appears to have slightly more variablity in log-adjusted income values, and its regression line seems to be steeper than that of the simulated data regression line.
# functions for this section were obtained from the textbook
# https://manncz.github.io/stat331-calpoly-text/10-predictive-checks.html#ch10-checkins
sims <- map_dfc(.x = 1:1000,
.f = ~ tibble(sim = rand_error(model_predict,
sd = est_sigma)
)
)
colnames(sims) <- colnames(sims) |>
str_replace(pattern = "\\.\\.\\.",
replace = "_")
sims2 <- regression_model_data |>
filter(!is.na(mean_inc),
!is.na(mean_educ)) |>
mutate(log_mean_inc = log10(mean_inc)) |>
select(log_mean_inc) |>
bind_cols(sims)
sim_r_sq <- sims2 |>
map(~ lm(log_mean_inc ~ .x, data = sims2)) |>
map(glance) |>
map_dbl(~ .x$r.squared)
sim_r_sq <- sim_r_sq[names(sim_r_sq) != "log_mean_inc"]tibble(sims = sim_r_sq) |>
ggplot(aes(x = sims)) +
geom_histogram(binwidth = 0.025) +
labs(x = expression("Simulated"~ R^2),
y = "",
subtitle = "Number of Simulated Models") +
theme_bw()avg_var <- round(mean(sim_r_sq),2)*100Our distribution of \(R^2\) values of simulated datasets is approximately normal, has values between 0.2 and 0.5, and is centered around 0.35. This suggests the data we simulated under this model have low to moderate similarity with our observed data. On average, our simulated data account for 35% of the variability in the observed log-adjusted income.
We explored the relationship between average daily income (in 2017 dollars) and average years in school for women ages 25-34 across 8 geographic regions, and averaged over 45 years. We performed a linear regression across all countries and log-transformed average income to linearize our dataset. Our model estimated a 24.71% increase in mean income for every additional year of education for women ages 25-54. Additionally, our linear model was able to explain 59.24% of the variance in our dataset, suggesting further predictors may better explain average income. When we generated a predictive model simulating our observed linear regression, we found it had a similar linear relationship and variability as our observed model. Our predictive checks ascribed a low to moderate relationship between our observed and predicted models. In conclusion, we found a moderate, positive correlation between number of years in school for women ages 25-34 and average daily income, which may benefit from exploring additional predictors of daily income.